#https://bookdown.org/yihui/rmarkdown/html-document.html
#install.packages('knitr', ependencies = TRUE)
#install.packages("devtools", lib="~/R/lib")
library(DT)
#devtools::session_info()The main goal of MEBS is capture with a single value the importance of complex metabolic pathways or biogeochemical cycles in a large omic datasets (either genomes or metagenomes). The algortithm has been thoroughly tested with the sulfur cycle, but currently other cycles are also supported. Use the script mebs.pl to avoid reading the manual and score your own genome/metagnome in terms of biogeochemical cycles. All that is required is a directory containing peptide FASTA files of encoded proteins/fragments with .faa extension.
The MEBS software is available as an open-source package distributed from a GitHub repository. Thus, the natural way of installing it is by cloning the repository via the following commands:
git clone https://github.com/eead-csic-compbio/metagenome_Pfam_score
#Alternatively, a ZIP file can be downloaded and then unpacked:
unzip metagenome_Pfam_score-master.zipBefore you start, make sure you have hmmserch installed, v3.1b1 or greater, otherwise the program will generate errors, see issue 1
perl mebs.pl
usage: mebs.pl [options]
-help Brief help message
-input Folder containing FASTA peptide files (.faa) (required)
-type Nature of input sequences, either 'genomic' or 'metagenomic' (required)
-fdr Score cycles with False Discovery Rate 0.1 0.01 0.001 0.0001 (optional, default=0.01)
-cycles Show currently supported biogeochemical cycles
-comp Compute the metabolic completeness (optional)
The following biogeochemical cycles are ready to use with MEBS:
perl mebs.pl -cycles
# Available cycles:
sulfur
carbon
oxygen
iron
nitrogenThe main modalities of MEBS are
Mebs assumes that your sequencing data is in fasta format (.faa) extension in an especific directory. Example of the input data can be foun in test_genomes/ or test_metagenomes/ directories
head -3 test_genomes/*.faa
==> test_genomes/Archaeoglobus_profundus_DSM_5631.faa <==
>WP_012766394.1 hypothetical protein [Archaeoglobus profundus]
MGSQEVGRIEEEVVEERRQEEEEIDEEEATGSALLTAEEFDKKIEEIKAKVREAVQEALANTLADLLEKEEKEEKRKDET
KAEELPKCPKNLSWLKKMFEILPLDILRNSKLWRYRHCVWALQEAEKEAEKFRNS
==> test_genomes/Enterococcus_durans.faa <==
>WP_000053907.1 MULTISPECIES: replication control protein PrgN [Bacilli]
MSLKNYVYSHPVNVFIIGRLGLSVEVFCELYGFKQGTLSSWVTREKTVASLPIEFLHGLSLASGETMDQVYDCLCVLEQE
YIEYKIANELRKRKKYIQTo run MEBS you only need to specifyt the input folder and the type of data (either genomic or metagenomic). The latter is required for MEBS to allocate the pre-computed entropies for each type of data considering the fragmentary nature of the metagenomic sequences.
perl mebs.pl -input test_genomes/ -type genomic
sulfur carbon oxygen iron nitrogen
Enterococcus_durans.faa -0.063 0.284 0.883 0.214 3.044
Archaeoglobus_profundus_DSM_5631.faa 11.434* 24.834* 1.493 0.765 6.873The scores that meets the criteria of specific FDR are shown in asterisc, yet the score will be the same regardless of the FDR that is used. If the Score if greater or equal to the FDR, then an asterisc will be shown in the output. In the case of using the default FDR (0.01), more false positive will be obtained, for example the genome Archaeoglobus profundus a well known microorgnism involved in the S-cycle, could seem to have a CH4 metabolism by using a default FDR,however if we increase to FDR 0.001, the C cycle asterisc is gone and only the S-cycle ramain. Therefore, we recomend a more restrictive FDR in order to eliminate false positives.
perl mebs.pl -input test_genomes/ -type genomic -fdr 0.001
sulfur carbon oxygen iron nitrogen
Enterococcus_durans.faa -0.063 0.284 0.883 0.214 3.044
Archaeoglobus_profundus_DSM_5631.faa 11.434 24.834* 1.493 0.765 6.873If you attempt to benchmark your own metabolism, we recomend to add your own FDR values in this config file at the end of this file. In a 16.04 Ubuntu system, 16Gb RAM, intel Inside i7 the time to run the scritpt in the example folder is less than 20 seconds.
real 0m14.183s
user 0m22.961s
sys 0m0.865sIf you want to compare your results with a dataset of 2107 non-redundant genomes plese check the table below.
scoresGen<-read.table("MEBS_values.tab", row.names = 1, header=TRUE, sep="\t")
datatable(scoresGen)Also, If you want to perform your own analysis and comparisions (i.e clustering your data to see which is the genome with more metabolic relateness in terms of metabolic capabilities) you can have acces to the (Table 6) 6)[ftp://parrot.genomics.cn/gigadb/pub/10.5524/100001_101000/100357/sup-tables-csv/S6.GenDataset_SS_values.csv]: Described in the MEBS database De Anda V , Zapata-Penasco I, Poot-Hernandez AC et al. Multigenomic Entropy Based Score (MEBS): the molecular reconstruction of the sulfur cycle. GigaScience Database 2017. http://dx.doi.org/10.5524/100357.
In the case of using metagenomic data metagenomic data the mean size length (MSL) and the size of the allocated entropies (MSLbin) is especify as warning. If you redirect the output of the script to a file this information will not be printed.
perl mebs.pl -input test_metagenomes -type metagenomic
# mebs.pl -input test_metagenomes -type metagenomic -fdr 0.01 -comp 0
# Computing Mean Size Length (MSL) ...
# 4511045.3_metagenome.faa MSL=32 MSLbin=30
# 4440966.3_metagenome.faa MSL=175 MSLbin=150
sulfur carbon oxygen iron nitrogen
4511045.3_metagenome.faa -2.295 1.790 5.412 2.745 13.024
4440966.3_metagenome.faa 5.817 8.804 1.178 4.579 11.697As MEBS needs to compute the MSL of each metagenome the time will depend on the size of the input sample. In the test_metagenomes/ directory there are only two metagenomes with size of 2,8M and 6,5M respectively. The computation time will depend on the size and the number of metagenomes in the input forlder. In the example it takes less than 40 seconds to finish.
real 0m36.736s
user 1m40.328s
sys 0m2.599sCreate a file with the id’s of interest. As example, we provide a list of microbialites (microbial mats and stromatolites) and other environments publicly available from MG-RAST. Please have a look at the table containing metadata of the metagenomic dataset used in De Anda et al., 2017
| MG-RAST id | Type and origin | Reference |
|---|---|---|
| 4449591.3 4449590.3 | Stromatolites from Highborne Cay, Bahamas | (Khodadad & Foster, 2012) |
| 4445126.3 4445129.3 | Polar microbial mats (Antartic) | (Varin et al., 2012) |
| 4442467.3 4442466.3 4441363.3 4441347.3 | Freshwater microbial mats from Cuatro Cienegas Basin (CCB | (Bonilla-Rosso et al., 2012; Peimbert et al., 2012) |
| 4443746.3 4443747.3 4443762.3 4443749.3 4443750.3 | HMicrobial mats from Yellowstone National Park | (Bhaya et al., 2007) |
| 4454153.3 | Purple sulfur bacteria biofilm | (Wilbanks et al., 2014) |
| 4440060.4 4440067.3 | Stromatolite and thrombolite from CCB | (Breitbart et al., 2009; Desnues et al., 2008) |
| 4487624.3 4487625.3 | Hydrothermal vents Taiwan | (Tang et al., 2013) |
| 4441138.3 4441137.3 | Acid Mine Dreinage (AMD) mats | (Jiao et al.,2011) |
| 4449206.3 | Hydrothermal vent Andes Colombia | (Jiménez et al., 2012) |
| 4491734.3 | Polar cryconite | (Edwards et al., 2013) |
| 4744007.3 4744008.3 4744009.3 4744010.3 4744011.3 4744012.3 4744013.3 4744014.3 4744015.3 4744016.3 4744017.3 4744018.3 | Microbial mats from Churince CCB | (De Anda et al., in review) |
We have also included freshwater microbrialites from Pavilion Lake, Clinton Creek described in Allen-White III et al., 2016, Allen-White III., et al., 2016, Respectively, hypersaline from marine environements described in Ruvindy et al., 2016
head
4449591.3
4449590.3
4445126.3
4445129.3
4442467.3
4442466.3
4441363.3
4441347.3This can take an hour depending on your download bandwith
cd test_mats
time for line in `cat IDmats.txt`; do wget "http://api.metagenomics.anl.gov/1/download/mgm$line?file=350.1" -O $line.faa ; doneUse restrictive FDR and save the output in a tabular file
perl mebs.pl -input test_mats -input test_mats -type metagenomic -fdr 0.0001 -comp > scores.tab
# mebs.pl -input test_mats -type metagenomic -fdr 0.0001 -comp 1
# Computing Mean Size Length (MSL) ...
# 4441588.3.faa MSL=146 MSLbin=150
# 4442467.3.faa MSL=99 MSLbin=100
# 4443749.3.faa MSL=210 MSLbin=200
# 4441137.3.faa MSL=209 MSLbin=200
# 4441571.3.faa MSL=213 MSLbin=200
# 4442466.3.faa MSL=75 MSLbin=60
# 4493725.3.faa MSL=223 MSLbin=200
........Lets have a look at the output
According to the restrictive FDR, used (represended as steriscs),few metagenomes are over-represented in the Oxygen and Nitrogen Cycle. However, we know according to our previous benchmark with the SS, we know that depending on the MSL of the metagenome, a score above 7, generally is over-represented in the S-cycle.
To avoid long names, lets parse the file and remove the faa extension
sed 's/\.faa//g' scores.tab | sed 's/\*//g' | cut -f 1,2,3,4,5,6 > scores2plot.tabThe file to plot should look like this
head scores2plot.tab
sulfur carbon oxygen iron nitrogen
4441588.3 -2.367 -1.331 0.944 1.173 1.801
4442467.3 3.098 8.511 7.088 8.758 17.550
4443749.3 3.520 10.772 8.047 3.474 14.217
4441137.3 5.159 9.276 7.634 4.824 17.251
4441571.3 8.020 26.993 7.378 8.700 19.469
4442466.3 5.359 13.446 7.593 7.638 18.965
4493725.3 9.547 17.864 7.481 9.279 18.563
4445129.3 5.672 16.691 7.727 8.394 18.842
4443750.3 2.660 3.959 7.052 3.709 10.587
........Use mebsResults.py script to plot the results
python3 mebsResults.py scores2plot The above will generate a png file containing the Score values for each input metagenome. We have also included 12 metagenomes from CCB named from 1-12.
scores2plot
Use the script completeness.py to create a heatmap with the percentage of completeness
python3 completeness.py scores.tab scores2plot
If you want to customize your heatmap, see the following notebook
Futhermore, you can use ClustVis, to visualizate your data. Just upload the scores2plot.tab file, transpose the data and generate the PCA
PCA
To compare your data with the maximum scores that you can obtain from the entropy data, have a look at the following data If you sare computing MEBS in genomes compare your results with the row “Genomic data”. In the case that you are computing MEBS in metagenomes see the corresponding MSL and MSLbin you compare your results.
| sulfur | methane | oxygen | iron | nitrogen | |
|---|---|---|---|---|---|
| Genomic data | 16.018 | 85.332 | 10.703 | 10.464 | 22.079 |
| 30 | 13.676 | 84.503 | 10.438 | 8.843 | 20.642 |
| 60 | 16.818 | 85.347 | 11.253 | 9.567 | 22.148 |
| 100 | 15.566 | 85.221 | 9.965 | 10.676 | 21.43 |
| 150 | 15.848 | 84.81 | 10.152 | 10.316 | 21.379 |
| 200 | 15.887 | 84.765 | 10.463 | 9.832 | 21.938 |
| 250 | 16.031 | 85.057 | 10.387 | 10.215 | 21.853 |
| 300 | 15.929 | 84.942 | 10.569 | 10.284 | 21.968 |
Below we summarize the internal steps performed with MEBS in the basic mode using the main script mebs.pl
MEBS flowchart basic mode
The following external packages are required if you want to benchmark your own metabolic pathway. Interproscan and hmmsearch are needed in order to annotate Pfam domains within peptide sequences. The rest of packages are needed to run the full pipeline, which comprises four steps.
For more advanced uses a manual is provided. The required input data are:
These inputs are processed in order to train a classifier which internally uses Pfam domains.
As seen above, genomes or metagenomes provided by the user can then be scored with the trained classifier. Once a classifier has been trained, such as the Sulfur cycle, steps 1 and 3 can be skipped.
MEBS flowchart advance mode
Planned feature improvements are publicly catalogued at the main MEBS development site on github. Bug reports and problems using MEBS are welcome on the issues tracker. We prefer posting to the issue tracker over email as these posts are searchable by other users who may experience the same problems.
We’d love to hear from you. Any comments and suggestions can be sent to Valerie De Anda valdeanda@ciencias.unam.mx
If you find this software usefull please cite us as: